Submitting a batch request involves two basic processes:
In addition, you can monitor the job request (with qstat) or delete the job request (with qdel).
The batch request is contained within a shell script. In it's simplest form, the batch request is comprised of the commands that invoke your application. For example:
The shell used to execute the script is often your login shell, although the shell is determined by the NQS manager. The script will execute in your home directory, unless the script explicitly changes the directory (using cd, for example).
If you are unfamiliar with shell scripts, the standard OSF/1 documentation gives user and reference information on developing and invoking shell scripts. The qsub manual page provides more specific information on including qsub invocation flags in the shell script.
The batch request is contained within the shell script that you composed in the previous section. You submit the batch request to a batch queue using the NQS qsub command.
For example, assume that you have a program called myjob that you want to run on 16 nodes, and you have a shell script called job1 that runs the program. You would submit the job via a batch request to a queue of at least 16 nodes.
You might also have a second job that you want to send to the same queue. In the following example, two jobs (job1 and job2) have been queued to batch queue lanl.day:
The -q switch in the previous example specifies the queue that you are submitting the request to. If you don't use the -q switch, qsub will look for a default queue in your environment variable QSUB_QUEUE. If you leave the switch out and have not defined QSUB_QUEUE, NQS uses the default queue set up by the system administrator (using the qmgr command). If qsub cannot find a default queue, it will fail.
To see what queues are available, use the qstat command:
The -x switch in the previous example exports all of the user environment variables with the request. As shown in the previous example, you can immediately submit another request without waiting for the first request to finish. With all other parameters being equal, batch requests to the same queue execute in the order they are submitted.
In many cases, you will want to be notified when the request starts and finishes execution. Use the qsub -mb switch to get notification when the request starts; use the qsub -me switch to get notification when the request ends. For example:
The NQS system will now send you mail at the beginning (-mb) and end (-me) of the job.
If your application needs fewer nodes than the number of nodes allowed by the queue, you can use the qsub -lP switch to limit the number of nodes your application uses. The value specified must be less than or equal to the queue's node limit. For example, the following entry limits your application to twenty nodes:
When specifying the number of nodes, keep in mind that the amount of CPUs consumed is proportional to the CPU time used multiplied by the number of nodes of the request. Limiting the number of nodes can save system resources.
The -lP switch is only available on Paragon systems. On a remote workstation, you can specify the number of nodes with the NCPUS environment variable, and then use the qsub -x switch to export the environment variable.
If your application doesn't need to run as long as the queue will allow, you should specify a shorter run time with the qsub -lT flag. Note that if you submit a job that requires more time than is left before the queue is stopped, your job will not start. For example:
The application will run for ten minutes, and you will receive a warning one minute before the job is killed.
Standard output and standard error messages are written to files named myjob.oNN and myjob.eNN, respectively, in the directory from which the request is submitted. The NN value is a job number assigned by NQS and shown when the qsub command executes.
Note that nqs uses only the first 5 characters of your script name, so if you
have a script named 'myjob123.nqs', the '.e' file will be named myjob.eNNN. Another
example is 'myjb.nqs'. The .e file now will be 'myjb..eNNN'.
You can monitor queue status and the completion status of requests using the qstat command. For example
You may occasionally need to delete a batch request after you have submitted it. The most direct way to delete a batch request is with the qdel command (use qdel -k to kill a job that has started running). Refer to the qdel command description for more information. In the following example, two jobs are checked and then deleted:
In general, a batch request cannot be modified by the user once it has been queued, but you can delete it (as described previously) and then resubmit a modified request. Composing the Shell Script
% cat myapp.sh
#!/bin/csh
#Use below as a template for qsub command
#snl has big proc limit
#qsub -re -ro -q snl -lT 8:00:00 -lP 500 myapp.sh
date
#the line below tells NQS to cd to the directory from
# which you did your qsub when nqs runs your script
cd $QSUB_WORKDIR
yod -sz 500 myprog myoptions
Submitting the Batch Request
% qsub -q lanl.day -x job1
Request 136.janus submitted to queue: lanl.day.
Account = 0
% qsub -q lanl.day -x job2
Account = 0
Request 137.janus submitted to queue: lanl.day.
Specifying a Queue
% qstat -b
To see the characteristics of all queues, include the qstat -b and -f flags.
Note that this also lists who can submit jobs to that queue and the maximum
number of nodes and time for jobs to that queue.
To see the characteristics of a particular queue, add the queue name.
============================================
NQS Version: 2 BATCH QUEUES on janus
============================================
QUEUE NAME STATUS TOTAL RUNNING QUEUED HELD TRANSITION NODE_GROUP
-----------------------------------------------------------------------------
snl AVAILBL 2 0/10 2 0 0 node_grp_one
snl.day AVAILBL 0 0/10 0 0 0 node_grp_one
llnl AVAILBL 0 0/10 0 0 0 node_grp_one
llnl.day AVAILBL 0 0/10 0 0 0 node_grp_one
lanl AVAILBL 0 0/10 0 0 0 node_grp_one
lanl.day AVAILBL 0 0/10 0 0 0 node_grp_one
express STOPPED 0 0/10 0 0 0 node_grp_one
intel AVAILBL 0 0/10 0 0 0 node_grp_one
hold STOPPED 0 0/1 0 0 0 (NONE)
snl.big AVAILBL 0 0/10 0 0 0 node_grp_one
llnl.big AVAILBL 0 0/10 0 0 0 node_grp_one
lanl.big AVAILBL 0 0/10 0 0 0 node_grp_one
snl.full STOPPED 0 0/20 0 0 0 node_grp_one
edu UNAVAIL 0 0/10 0 0 0 node_grp_one
edu.day UNAVAIL 0 0/10 0 0 0 node_grp_one
edu.big UNAVAIL 0 0/10 0 0 0 node_grp_one
% qstat -b -f lanl.day
The users who can submit to lanl.day are aaalph,kdaaaas etc. The max nodes you can use is 512. The maximum time you can use is 7200.0 seconds. That is 7200 seconds/node.
=====================================================
NQS Version:2 BATCH QUEUE: lanl.day.janus status: AVAILBL
=====================================================
Priority: 1
ENTRIES:
Total: 0 Running: 0
Queued: 0 Held: 0 Transition: 0
EFFECTIVE PRIORITY LIMITS:
Non-prime time: 3
Prime time: 1
RUN_LIMIT:
Runlimit: 10
NODE_GROUP:
Node_group: node_grp_one
Nodes_prime: 3456
Nodes_nprime: 3456
COMPLEX MEMBERSHIP:
RESOURCES:
Per-proc core file size limit= UNLIMITED Exporting Environment Variables
Getting Job Request Start/Finish Notification
% qsub -q snl.day -mb -me myapp
Account = 0
Request 127.prefect submitted to queue: snl.day
Limiting the Number of Nodes
% qsub -q snl.day -lP 20 myapp
NOTE
Limiting CPU Time Usage
% qsub -q snl.big -lT600,60 myapp
Finding Standard Output and Standard Error
Monitoring Request Execution
% qstat -a
===============================================================
NQS Version:2 BATCH PIPE REQUESTS on janus
===============================================================
REQUEST NAME OWNER QUEUE PRI NICE CPU MEM STATE
3139.janus Asym2.n pjaaabe snl 1.7 0 40000 UNLIM. QUEUED
3158.janus Fccedge pjaaabe snl 1.0 0 28800 UNLIM. QUEUED
3159.janus submit ljaaaen snl.day 1.0 0 3600 UNLIM. RUNNING
Deleting a Batch Request
$ qstat
===============================================================
NQS Version:2 BATCH PIPE REQUESTS on janus
===============================================================
REQUEST NAME OWNER QUEUE PRI NICE CPU MEM STATE
3139.janus Asym2.n pjaaabe snl 1.7 0 40000 UNLIM. QUEUED
3158.janus Fccedge pjaaabe snl 1.0 0 28800 UNLIM. QUEUED
3159.janus submit pjaaabe snl.day 1.0 0 3600 UNLIM. RUNNING
% qdel 3139
Request 3139 has been deleted.
% qdel -k 3159
Request 3159 is running, and has been signalled.
% qstat
===============================================================
NQS Version:2 BATCH PIPE REQUESTS on janus
===============================================================
REQUEST NAME OWNER QUEUE PRI NICE CPU MEM STATE
3158.janus Fccedge pjaaabe snl 1.0 0 28800 UNLIM. QUEUED
Modifying a Batch Request