当前位置: 首页 > 工具软件 > jobs > 使用案例 >

Batch Jobs

梁勇
2023-12-01

Batch Jobs

 

 

commands: qsub, qstat, qdel

Within the alibaba cluster, the batch queing system torque is used. torque an open source resource manager providing control over batch jobs running on the compute nodes.

The most important commands are qsub for submitting a job, qstat for monitoring its status, and qdel for deleting a job.

For the description of these and related other commands:

qalter, pbs_alterjob, pbs_statjob, pbs_statque, pbs_statserver, pbs_submit, pbs_job_attributes, pbs_queue_attributes, pbs_server_attributes, pbs_resources_*

see http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki or the corresponding man-pages.

 

qsub

The qsub command usually is called with the filename of a script as a parameter. That script holds job parameters as well as the call of the actual programme. Parameters are placed as a comment ("#") in the first lines of the script and start with the command prefix "PBS" followed by the parameter setting, eg

...
# PBS -l walltime=6:10:00
...

to set the maximum amount of real time during which the job can be in the running state. Parameters can also be specified can as command-line arguments. eg

> qsub -l nodes=12

to request 12 nodes. Command line arguments take precedence over parameters set in the script.

Important options are:

Option Description
-N name job name
-q queue destination queue
-S shell path to the shell that interprets the job script
-j oe join stdout and stderr streams
-o filename file or directory to which the (joined) output is written
-m a send an e-mail notification in case of a job abort
-M e-mail e-mail address to which the notification is sent
-l resource(s) see below

The important resource parameters are:

Ressource Format Description
nodes number-of-nodes [:ppn= processes-per-node ] Number and type of nodes to be reserved for exclusive use by the job.
walltime seconds or [[HH:]MM:]SS Maximum amount of real time during which the job can be in the running state

At run time the following environment variables are set:

Variablename Used for
PBS_JOBNAME user specified jobname
PBS_O_WORKDIR job's submission directory
PBS_TASKNUM number of tasks requested
PBS_O_HOME home directory of submitting user
PBS_O_LOGNAME name of submitting user
PBS_JOBCOOKIE job cookie
PBS_NODENUM node offset number
PBS_O_SHELL script shell
PBS_O_JOBID unique pbs job id
PBS_O_HOST host on which job script is currently running
PBS_QUEUE job queue
PBS_NODEFILE file containing line delimited list on nodes allocated to the job
PBS_O_PATH path variable used to locate executables within job script

 

 

qstat

To monitor submitted jobs, the qstat command is used. Though not all job-information will be presented to a normal user, one can get information like job-ID, name, queing status etc. The output can be given in different formats and verbosity.

To get an overview in table form, type qstat without any argument.

Job id Name User Time Use S Queue
<ID> <Jobname> (given by user) <Username> Used CPU time Status (see below) <Jobque>

Status can be

C Job is completed after having run
E Job is exiting after having run
H Job is held
Q job is queued, eligible to run or routed
R job is running
T job is being moved to new location
W job is waiting for its execution time (-a option) to be reached

More detailed information can be requested by using the "-f" option:

> qstat -f [job_id]

For more information, see the man-page or the online man page of qstat at torque.

 

qdel

The qdel command is used to delete a job, which has to be specified by its job-identifier, that is, type

> qdel <job_id>

to delete the job with the id <id>. After submission of the command, a "Delete Job batch request" will be sent to the batch server that owns the job. See the man-page for more information.

 

 

examples

Simple examples are given for serial and parallel batch jobs.

 

serial programme

 

#PBS -N test1
#PBS -j oe
#PBS -o /home/user/test/test1.log
#PBS -l walltime=100
#
set -x
cd /work/user
/home/user/test/a.out
exit

The batch jobs executes /home/user/test/a.out in directory /work/user and writes a log file to /home/user/test/test1.log

 

parallel: MPI

 

#PBS -N test2
#PBS -j oe
#PBS -o /home/user/test/test2.log
#PBS -l nodes=3:ppn=8
#PBS -l walltime=100
#
set -x
cd /work/user
/home/user/test/a.out -np 24
exit

The job executes on 3 nodes using all 8 cores of each node. On has to make sure that the "-l nodes=...:ppa=..." and "-np ..." specifications match. In parallel jobs one should always request 8 cores per node (ie ppn=8). Otherwise one would share nodes with other users what should be avoided.

A special case is ppn=4 (or smaller). In that case one should also specify the interconnect one wants to use. This can be done by adding #PBS -q gbe for gigabit-ethernet or #PBS -q ib for inifiniband. (For ppn larger than 4 the system will automatically use the large nodes and infiniband.)

 

 

parallel: OpenMP

 

#PBS -N test3
#PBS -j oe
#PBS -o /home/user/test/test3.log
#PBS -l nodes=1:ppn=T
#PBS -l walltime=100
#
set -x
cd /work/user
export OMP_NUM_THREADS=T
/home/user/test/a.out
exit

Where T stands for the number of threads. In parallel jobs one should always request T = 8 cores per node (ie ppn=8). Otherwise one would share nodes with other users what should be avoided. As a consequence OMP_NUM_THREADS should be set to 8.

An alterative is requesting 4 threads. Then one should explicitly request small nodes by adding #PBS -q gbe.

 类似资料:

相关阅读

相关文章

相关问答