NQS Batch Requests

Submitting a batch request involves two basic processes:

In addition, you can monitor the job request (with qstat) or delete the job request (with qdel).

Composing the Shell Script

The batch request is contained within a shell script. In it's simplest form, the batch request is comprised of the commands that invoke your application. For example:


% cat myapp.sh
#!/bin/csh #Use below as a template for qsub command #snl has big proc limit #qsub -re -ro -q snl -lT 8:00:00 -lP 500 myapp.sh date #the line below tells NQS to cd to the directory from # which you did your qsub when nqs runs your script cd $QSUB_WORKDIR
yod -sz 500 myprog myoptions

The shell used to execute the script is often your login shell, although the shell is determined by the NQS manager. The script will execute in your home directory, unless the script explicitly changes the directory (using cd, for example).

If you are unfamiliar with shell scripts, the standard OSF/1 documentation gives user and reference information on developing and invoking shell scripts. The qsub manual page provides more specific information on including qsub invocation flags in the shell script.

Submitting the Batch Request

The batch request is contained within the shell script that you composed in the previous section. You submit the batch request to a batch queue using the NQS qsub command.

For example, assume that you have a program called myjob that you want to run on 16 nodes, and you have a shell script called job1 that runs the program. You would submit the job via a batch request to a queue of at least 16 nodes.

You might also have a second job that you want to send to the same queue. In the following example, two jobs (job1 and job2) have been queued to batch queue lanl.day:


% qsub -q lanl.day -x job1
Request 136.janus submitted to queue: lanl.day.
Account = 0
% qsub -q lanl.day -x job2
Account = 0
Request 137.janus submitted to queue: lanl.day.

Specifying a Queue

The -q switch in the previous example specifies the queue that you are submitting the request to. If you don't use the -q switch, qsub will look for a default queue in your environment variable QSUB_QUEUE. If you leave the switch out and have not defined QSUB_QUEUE, NQS uses the default queue set up by the system administrator (using the qmgr command). If qsub cannot find a default queue, it will fail.

To see what queues are available, use the qstat command:


% qstat -b
============================================ NQS Version: 2 BATCH QUEUES on janus ============================================ QUEUE NAME STATUS TOTAL RUNNING QUEUED HELD TRANSITION NODE_GROUP ----------------------------------------------------------------------------- snl AVAILBL 2 0/10 2 0 0 node_grp_one snl.day AVAILBL 0 0/10 0 0 0 node_grp_one llnl AVAILBL 0 0/10 0 0 0 node_grp_one llnl.day AVAILBL 0 0/10 0 0 0 node_grp_one lanl AVAILBL 0 0/10 0 0 0 node_grp_one lanl.day AVAILBL 0 0/10 0 0 0 node_grp_one express STOPPED 0 0/10 0 0 0 node_grp_one intel AVAILBL 0 0/10 0 0 0 node_grp_one hold STOPPED 0 0/1 0 0 0 (NONE) snl.big AVAILBL 0 0/10 0 0 0 node_grp_one llnl.big AVAILBL 0 0/10 0 0 0 node_grp_one lanl.big AVAILBL 0 0/10 0 0 0 node_grp_one snl.full STOPPED 0 0/20 0 0 0 node_grp_one edu UNAVAIL 0 0/10 0 0 0 node_grp_one edu.day UNAVAIL 0 0/10 0 0 0 node_grp_one edu.big UNAVAIL 0 0/10 0 0 0 node_grp_one
To see the characteristics of all queues, include the qstat -b and -f flags. Note that this also lists who can submit jobs to that queue and the maximum number of nodes and time for jobs to that queue. To see the characteristics of a particular queue, add the queue name.

% qstat -b -f lanl.day

===================================================== NQS Version:2 BATCH QUEUE: lanl.day.janus status: AVAILBL ===================================================== Priority: 1 ENTRIES: Total: 0 Running: 0 Queued: 0 Held: 0 Transition: 0 EFFECTIVE PRIORITY LIMITS: Non-prime time: 3 Prime time: 1 RUN_LIMIT: Runlimit: 10 NODE_GROUP: Node_group: node_grp_one Nodes_prime: 3456 Nodes_nprime: 3456 COMPLEX MEMBERSHIP: RESOURCES: Per-proc core file size limit= UNLIMITED Per-process data size limit = UNLIMITED Per-proc perm file size limit= UNLIMITED Per-proc execution nice value= 0 Per-req number of cpus limit = 512 Per-user # of requests limit = 1 Per-process stack size limit = UNLIMITED Per-process CPU time limit = UNLIMITED Per-request CPU time limit = 7200.0 Per-process working set limit= UNLIMITED ACCESS Groups: Users: root kdaaaas aaalph dpbaaai raaaat taaalso gwb haaaaer jeaaa juaaaac qkluege msaaare raaaret aaa qqee jrraaat tuaaaas aab raa aaay kaa awaaaa gaaox jaaalte ahk waa
The users who can submit to lanl.day are aaalph,kdaaaas etc. The max nodes you can use is 512. The maximum time you can use is 7200.0 seconds. That is 7200 seconds/node.

Exporting Environment Variables

The -x switch in the previous example exports all of the user environment variables with the request. As shown in the previous example, you can immediately submit another request without waiting for the first request to finish. With all other parameters being equal, batch requests to the same queue execute in the order they are submitted.

Getting Job Request Start/Finish Notification

In many cases, you will want to be notified when the request starts and finishes execution. Use the qsub -mb switch to get notification when the request starts; use the qsub -me switch to get notification when the request ends. For example:


% qsub -q snl.day -mb -me myapp
Account = 0
Request 127.prefect submitted to queue: snl.day

The NQS system will now send you mail at the beginning (-mb) and end (-me) of the job.

Limiting the Number of Nodes

If your application needs fewer nodes than the number of nodes allowed by the queue, you can use the qsub -lP switch to limit the number of nodes your application uses. The value specified must be less than or equal to the queue's node limit. For example, the following entry limits your application to twenty nodes:


% qsub -q snl.day -lP 20 myapp
NOTE

When specifying the number of nodes, keep in mind that the amount of CPUs consumed is proportional to the CPU time used multiplied by the number of nodes of the request. Limiting the number of nodes can save system resources.

The -lP switch is only available on Paragon systems. On a remote workstation, you can specify the number of nodes with the NCPUS environment variable, and then use the qsub -x switch to export the environment variable.

Limiting CPU Time Usage

If your application doesn't need to run as long as the queue will allow, you should specify a shorter run time with the qsub -lT flag. Note that if you submit a job that requires more time than is left before the queue is stopped, your job will not start. For example:


%
qsub -q snl.big -lT600,60 myapp

The application will run for ten minutes, and you will receive a warning one minute before the job is killed.

Finding Standard Output and Standard Error

Standard output and standard error messages are written to files named myjob.oNN and myjob.eNN, respectively, in the directory from which the request is submitted. The NN value is a job number assigned by NQS and shown when the qsub command executes.

Note that nqs uses only the first 5 characters of your script name, so if you have a script named 'myjob123.nqs', the '.e' file will be named myjob.eNNN. Another example is 'myjb.nqs'. The .e file now will be 'myjb..eNNN'.

Monitoring Request Execution

You can monitor queue status and the completion status of requests using the qstat command. For example


% qstat -a 
===============================================================
NQS Version:2    BATCH  PIPE REQUESTS on janus 
===============================================================
 REQUEST       NAME    OWNER      QUEUE      PRI  NICE CPU     MEM     STATE
 3139.janus    Asym2.n pjaaabe    snl        1.7    0  40000   UNLIM.  QUEUED
 3158.janus    Fccedge pjaaabe    snl        1.0    0  28800   UNLIM.  QUEUED
 3159.janus    submit  ljaaaen    snl.day    1.0    0  3600    UNLIM.  RUNNING



Deleting a Batch Request

You may occasionally need to delete a batch request after you have submitted it. The most direct way to delete a batch request is with the qdel command (use qdel -k to kill a job that has started running). Refer to the qdel command description for more information. In the following example, two jobs are checked and then deleted:



$ qstat
=============================================================== NQS Version:2 BATCH PIPE REQUESTS on janus =============================================================== REQUEST NAME OWNER QUEUE PRI NICE CPU MEM STATE 3139.janus Asym2.n pjaaabe snl 1.7 0 40000 UNLIM. QUEUED 3158.janus Fccedge pjaaabe snl 1.0 0 28800 UNLIM. QUEUED 3159.janus submit pjaaabe snl.day 1.0 0 3600 UNLIM. RUNNING
% qdel 3139
Request 3139 has been deleted.
% qdel -k 3159
Request 3159 is running, and has been signalled.
% qstat
=============================================================== NQS Version:2 BATCH PIPE REQUESTS on janus =============================================================== REQUEST NAME OWNER QUEUE PRI NICE CPU MEM STATE 3158.janus Fccedge pjaaabe snl 1.0 0 28800 UNLIM. QUEUED

Modifying a Batch Request

In general, a batch request cannot be modified by the user once it has been queued, but you can delete it (as described previously) and then resubmit a modified request.

Acknowledgement and Disclaimer