Louhi User's Guide, the 2nd Edition > Batch jobs and the batch system > Monitoring and displaying job and queue status
Tehdyt toimenpiteet

Monitoring and displaying job and queue status

This section gives examples how to monitor and display job, queue, and PBS server status.

Please note, that listings of comands here are only examples. This is because the configuration changes from time to time and updating the listings should be done fairly often. They are similar in the present Louhi sysem but, for eaxmple, the queue names and certain properties are not valid now or anymore. You can get always the present listings by the showed commands. Especially the queuing system confoguration is now under progress and the final or stable configuration may be very different than it is now. See also chapter Submitting jobs: qsub.

Job status

qstat

Examples of the command qstat, which displays job information, are given here. For full description, see the manual page qstat(1B). For information about the options, see also the subsection Status of jobs, queues and batch server: qstat

qstat

Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
78478.sdb         prog128d20       user1            00:00:00 R smem-xt5       
78574.sdb         prog128d16       user1            00:00:00 R smem-xt5       
78597.sdb         my_prog          user2            00:00:04 R small-xt5      
78669.sdb         simul            user3            00:00:00 R smem-xt5     
...

The output of the command qstat without options shows:

  • Job id: the job identifier assigned by PBS
  • Name: the job name given by the submitter or formed automatically
  • User: the job owner
  • Time Use: the CPU time used
  • S: the job state:
    E - job is exiting after having run
    H - job is held
    Q - job is queued, eligible to run or routed
    R - job is running.
    T - job is being moved to new location.
    W - job is waiting for its execution time
    (-a option) to be reached.
    S - job is suspended.
  • Queue: the queue in which the job resides

See also the command cqstat below.

qstat -a

sdb: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
78478.sdb       user1   smem-xt5 prog128d2   7480   1   1    --  72:00 R 61:23
78574.sdb       user1  smem-xt5 prog128d16 23778   1   1    --  72:00 R 51:41
78597.sdb       user2  small-xt my_prog    13761   1   1    --  72:00 R 50:15
78669.sdb       user3 smem-xt5 simul      23946   1   1    --  72:00 R 46:35

....

The output of the command qstat -a (and also with any of the options -i, -r, -u, -n, -s, -G or -M) shows:

  • Job ID: the job identifier assigned by PBS
  • Username: the job owner
  • Queue: the queue in which the job resides
  • Jobname: the job name
  • SessID: the session id (if the job is running)
  • NDS:  the number of chunks or nodes requested by the job (not shown correctly)
  • TSK: the number of CPUs requested by the job (not shown correctly)
  • Req'd Memory:   amount of memory requested by the jobthe number of nodes requested by the job
  • Req'd Time: either the CPU time, if specified, or wall time requested by the job, (hh:mm).
  • S: the job's current state (see above)
  • Elap Time: the amount of CPU time or wall time used by the job (hh:mm).

See the command cqstat described below. It shows the number of PEs (cores), the number of cores for each PE and its threads (depth) and the number of PEs per node.

qstat -u ...

Information of a job or jobs of a particular user can be displayed by commands of the type:

qstat -u user1 3779
qstat -u user1

The first command shows the job 3779 (sequence number) of the user user1 and the second one all jobs of that user.

qstat -f | less

qstat -f 3792 | less

The command qstat -f shows lots of information about all jobs and the form qstat -f 3792 about a particular job (here 3792). This includes, e.g., used resources until now:

    resources_used.cpupercent = 0
    resources_used.cput = 00:00:00
    resources_used.mem = 6236kb
    resources_used.ncpus = 1
    resources_used.vmem = 29748kb
    resources_used.walltime = 00:55:47

and resources reserved for the job:

    Resource_List.mpparch = XT
    Resource_List.mppmem = 950mb
    Resource_List.mppnppn = 2
    Resource_List.mppwidth = 256
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.place = pack
    Resource_List.select = 1
    Resource_List.walltime = 04:00:00

cqstat

This is the filter script for the command qstat -f and shows selected parts of its plentiful display. The result contains some information which is not correctly shown by the command qstat -a (see above). Especially cqstat shows the number of PEs (cores in pure MPI) (MPP Width), the number of cores for each PE and its threads (depth) (MPP Depth) and the number of PEs per node (MPP NPPN) correctly:

                                                  Time         MPP   MPP  MPP
Job ID       User    Jobname  S  Queue      Queued   Walltime Width Depth NPPN
----------- -------- -------- - --------    -------- -------- ----- ----- ----
78478.sdb   user1   prog128 R smem-xt5    13:06:52 61:05:12    64        8
78574.sdb   user1   prog128 R smem-xt5    03:25:20 51:23:47    64        8
78597.sdb   user2   my_prog  R small-xt5   01:58:57 49:57:25   256        8
78669.sdb   user3   XXyz     R smem-xt5    22:45:33 46:16:50    64        8
78704.sdb   user4 simul    R smem-xt5    20:43:28 42:57:19   128        8
...

Other commands

There are several commands starting with xt which show, among other things, information about jobs on computing nodes, see: Resource monitoring.


Queue status

Please note that the queue names, their configuration (e.g., maximum limits) and purpose may change from what is shown here. You can always find out the current configuration by the commands explained below.

qstat

qstat -Q

Queue           Max   Tot Ena Str   Que   Run   Hld   Wat   Trn   Ext Type
------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----
special 1 0 no yes 0 0 0 0 0 0 Exec
ops 0 0 no yes 0 0 0 0 0 0 Exec
workq 0 0 no yes 0 0 0 0 0 0 Exec
xt4lgmem          0     0 yes yes     0     0     0     0     0     0 Exec
xt4               0     6 yes yes     4     2     0     0     0     0 Exec
xt5               0     5 yes yes     2     3     0     0     0     0 Exec
all               0     0 yes yes     0     0     0     0     0     0 Exec
xt5smlmem         0     2 yes yes     2     0     0     0     0     0 Exec
xt5lgmem          0     0 yes yes     0     0     0     0     0     0 Exec

The non-existent queues special, ops and workq serve here as examples of disabled queues. The queue parallel is here a routing queue from which, e.g., the jobs in the hold state are moved to an executing queue depending on the resources requested.

The command qstat -Q shows the available queues and their status:

  • Queue: the queue name
  • Max: the maximum number of jobs that may be run in the queue concur­rently (0 = not defined)
  • Tot: the total number of jobs in the queue
  • Ena: the enable (yes) or disabled (no) status of the queue
  • Str: the started (yes) or stopped (no) status of the queue
  • Que, Run, Hld, Wat, Trn, Ext (same as Q, R, H, W, T, E above, respectively):

for each job state, the name of the state and the number of jobs in the queue in that state.

  • Type: the type of queue, execution or routing.

qstat -q

This commad shows the status of queues in an alternative format:

server: sdb
Queue            Memory CPU Time Walltime Node   Run   Que   Lm  State
---------------- ------ -------- -------- ---- ----- ----- ----  -----
xt4lgmem           --      --    12:00:00  --      0     0   --   E R
xt4                --      --    12:00:00  --      2     4   --   E R
xt5                --      --    12:00:00  --      3     2   --   E R
all                --      --       --     --      0     0   --   E R
xt5smlmem          --      --    12:00:00  --      0     2   --   E R
xt5lgmem           --      --    12:00:00  --      0     0   --   E R
                                               ----- -----
                                                   5     8

  • Queue: the queue name
  • Memory: the maximum amount of memory a job in the queue may request
  • CPU Time the maximum amount of CPU time a job in the queue may request
  • Walltime: the maximum amount of wall time a job in the queue may request
  • Node: the maximum amount of nodes a job in the queue may request
  • Run: the number of jobs in the queue in the running state
  • Que: the number of jobs in the queue in the queued state
  • Lm: the maximum number (limit) of jobs that may be run in the queue concurrently
  • State: The state of the queue is given by a pair of letters:
    • E R - Enabled Running (started)
    • E S - Enabled Stopped
    • D R - Disabled Running (started)
    • D S - Disabled Stopped

The output shows that the maximum walltime has been defined for the executing queues, but no limits are set for maximum memory or CPU time. 

qstat -Q -f | less

This shows the full status of all queues including default resources (limits for reasources) and other restrictions of recourses. For a specific queue, add the queue name, e.g., medium in Louhi:

qstat -Q -f xt5

Queue: xt5
    queue_type = Execution
    total_jobs = 5
    state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:3 Exiting:0 Begun
        :0
    resources_max.mpparch = XT
    resources_max.mppnodes = 384-479,768-991,1280-1375,1408-1503,1792-2015
    resources_max.mppnppn = 8
    resources_max.walltime = 12:00:00
    resources_min.mpparch = XT
    resources_min.mppnodes = 384-479,768-991,1280-1375,1408-1503,1792-2015
    resources_default.mpparch = XT
    resources_default.mppnppn = 8
    resources_assigned.ncpus = 3
    resources_assigned.nodect = 3
    kill_delay = 10
    enabled = True
    started = True

This shows that there may be also minimum and maximum limits for certain resources. The maximum walltime in this queue is 12 hours. The default and maximum number of PEs are 8 (in XT4 queues these are 4). The listing shows also which nodes are available when the queue is used.

Server status

qstat

qstat -B

This shows the available PBS servers and their status:

Server             Max   Tot   Que   Run   Hld   Wat   Trn   Ext Status
---------------- ----- ----- ----- ----- ----- ----- ----- ----- -----------
sdb                  0    13     8     5     0     0     0     0 Active

Because there is only one PBS Server on Louhi you can get its full status by either of the following commands

qstat -B -f

qsata -B sdb or qstat -B -f nid00003

Server: sdb
    server_state = Active
    server_host = nid00003
    scheduling = True
    total_jobs = 13
    state_count = Transit:0 Queued:8 Held:0 Waiting:0 Running:5 Exiting:0 Begun
        :0
    managers = root@nid00008,root@boot001,ui14@nid00008
    default_queue = parallel
    log_events = 511
    mail_from = root
    query_other_jobs = True
    resources_default.mppmem = 1000mb
    resources_default.mppwidth = 128
    resources_default.walltime = 04:00:00
    default_chunk.ncpus = 1
    resources_max.mppmem = 2000mb
    resources_assigned.ncpus = 5
    resources_assigned.nodect = 5
    scheduler_iteration = 600
    flatuid = True
    FLicenses = 9463
    resv_enable = True
    node_fail_requeue = 310
    max_array_size = 10000
    pbs_license_file_location = 7788@nid00128
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 3600
    license_count = Avail_Global:9462 Avail_Local:1 Used:5 High_Use:6
    pbs_version = PBSPro_9.2.2.82426
    eligible_time_enable = False

The server resource limitations shown here applies to all queues, unless for queues themselves has not been set different limitations (see qstat -Q -f command above).

This shows that (when writing this)  default wall time is  four hours (4:00:00 or 14400 s, -l walltime=4:00:00).  If you don't need that much, you should request shorter time, and if you need more, you must request more. Default memory for PE is 950 MB (-l mppmem=1000M). Maximum memory per PE is set here to 2000 MB, but settings of queues themselves may overwrite it. Again, if you need less, request less, and if you need more, request more.

General usage policy and queue arrangements are described in section Usage policy and obtaining a user id.